9/24/2020

Agenda

  • Why not linear regression?
  • Logistic regression
  • Multinomial logistic regression
  • Evaluating accuracy
  • Bayes theorem, LDA, QDA
  • k Nearest Neighbours

Recap

What we saw last time

  • Logistic regression and how to interpret it
  • Types of error: false positives, false negatives
  • How to evaluate accuracy: ROC curve

Logistic Regression

You can think of logistic regression as a parametric model for \[\mathbb{P}(Y = y \mid \mathbf{X} = \mathbf{x})\] as the linear regression problem was a parametric model for \[\mathbb{E}(Y = y \mid \mathbf{X} = \mathbf{x})\]

The binary logistic regression, in particular, models directly the conditional probability of \(Y = 1 \mid \mathbf{X}\), using \[\mathbb{P}(Y_{i} = 1 \mid \mathbf{X}_{i} = \mathbf{x}_{i}) = F(\beta_{0} + \beta_{1} x_{i1} + \dots + \beta_{p} x_{ip}).\]

Conditional Probability and Classification

  • We can use \(\mathbb{P}(Y = y \mid \mathbf{X} = \mathbf{x})\) to classify \(Y\) given a new value of \(\mathbf{X}\). The most obvious thing to do is to predict the \(y\) that has the highest probability.
  • If there are only two possible outcomes for \(Y\), we are just picking the one with \(\mathbb{P}(Y = y \mid \mathbf{X} = \mathbf{x}) > 0.5\), as we did for simple logistic regression.
  • Given \(\mathbf{X} = \mathbf{x}^\star\), we can predict \(Y\) to be \(y^\star\), where \[\mathbb{P}(Y = y^\star \mid \mathbf{X} = \mathbf{x}^\star) = \text{max}_y \ \mathbb{P}(Y = y \mid \mathbf{X} = \mathbf{x}^\star)\]

Conditional Probability and Classification

  • The logistic regression models directly the conditional probability \(\mathbb{P}(Y = y \mid \mathbf{X} = \mathbf{x})\)
  • A number of classification techniques work by specifying the marginal distribution of \(Y\), the conditional distribution of \(\mathbf{X}\) given \(Y\) and then using Bayes Theorem to compute the conditional distribution of \(Y\) given \(\mathbf{X}\)
  • When we use normal (Gaussian) distributions for each class, this leads to linear or quadratic discriminant analysis (LDA and QDA)

Bayes theorem

Bayes Theorem simply says that if I give you \(p(y)\) and \(p(x \mid y)\), you can compute \(p(y \mid x)\). Remember the basic rules of probability?

\[\begin{aligned} \mathbb{P}(Y = k \mid X = x) &= \dfrac{\mathbb{P}(X = x \text{ and } Y = k)}{\mathbb{P}(X = x)} \\ &= \dfrac{\mathbb{P}(X = x \mid Y = k) \mathbb{P}(Y = k)}{\sum_{l=1}^K \mathbb{P}(X = x \mid Y = l) \mathbb{P}(Y = l)} \\ &= \dfrac{\pi_k f_k(x)}{\sum_{l=1}^K \pi_l f_l(x)} \end{aligned}\]

Bayes theorem

\[\underbrace{p_k(x)}_\text{posterior} = \mathbb{P}(Y = k \mid X = x) = \dfrac{\pi_k f_k(x)}{\sum_{l=1}^K \pi_l f_l(x)} \propto \underbrace{\pi_k}_\text{prior} \cdot \underbrace{f_k(x)}_\text{likelihood}\]

  • \(f_k(x)\) is the likelihood for \(X\) in class \(k\) (“how are the features distributed under each of the classes?”). We will use normal densities for these, separately in each class
  • \(\pi_k\) is the prior probability for class \(k\): it is how likely you think \(Y = k\) is before you see \(X = x\)
  • \(p_k(x) = \mathbb{P}(Y = k \mid X = x)\) is called the posterior probability that \(Y = k\): it is how likely you think \(Y = k\) is after you see \(X = x\)

Posterior \(\propto\) prior \(\times\) likelihood

Bayes theorem (example)

Events:

  • A: have Covid
  • B: test positive for Covid

Goal: find \(\mathbb{P}(A \mid B)\), i.e. the probability of having Covid given a positive test, using

  • \(\mathbb{P}(B \mid A) = 0.95\): probability of positive test when having Covid (true positive rate)
  • \(\mathbb{P}(A) = 0.01\): probability of having Covid (prevalence in the population)
  • \(\mathbb{P}(B \mid \sim A) = 0.01\): probability of positive test when NOT having Covid (false positive rate)

Bayes theorem (example)

\[ \begin{align} \mathbb{P}(A \mid B) &= \dfrac{\mathbb{P}(B \mid A) \mathbb{P}(A)}{\mathbb{P}(B \mid A) \mathbb{P}(A) + \mathbb{P}(B \mid \sim A) \mathbb{P}(\sim A)} \\ &= \dfrac{0.95 \cdot 0.01}{0.95 \cdot 0.01 + 0.01 \cdot 0.99} \\ \\ &= 0.49 \end{align}\]

Multiclass classification

Discriminant Analysis

  • The approach is to model the distribution of \(X\) in each of the classes separately, and then use Bayes theorem to flip things around and obtain \(\mathbb{P}(Y \mid X)\)
  • When we use normal (Gaussian) distributions for each class, this leads to linear or quadratic discriminant analysis
  • However, this approach is quite general, and other distributions can be used as well

LDA, QDA, Naive Bayes

Linear Discriminant Analysis, \(p = 1\)

  • We classify a new point according to which density is the highest
  • When the priors are different, we take them into account as well, and compare \(\pi_k f_k(x)\)
  • On the right, we upweight the purple class - the decision boundary has shifted to the right

Step 1

Estimate the Gaussian distributions \(f_0(x), f_1(x)\)

Step 2

Incorporate prior belief \(\pi_0 f_0(x), \pi_1 f_1(x)\)

Step 3

Normalize to get the posteriors \(p_0(x), p_1(x)\)

The math behind it

The Gaussian density has the form

\[f_k(x) = \dfrac{1}{\sqrt{2 \pi \sigma_k^2}} e^{- \frac{1}{2} \left( \frac{x - \mu_k}{\sigma_k} \right)^2}\]

Here \(\mu_k\) is the mean, and \(\sigma_k^2\) the variance, for the class \(k\).

LDA assumes that all the \(\sigma_k^2\) are the same, i.e. \(\sigma_k^2 = \sigma^2 \ \forall k\). Plugging this into Bayes formula, we get \[\mathbb{P}(Y = k \mid X = x) \propto \pi_k \dfrac{1}{\sqrt{2 \pi \sigma^2}} e^{- \frac{1}{2} \left( \frac{x - \mu_k}{\sigma} \right)^2}\]

The math behind it

To classify at the value \(X = x\), we need to see which of the curves is largest. Taking logs, and discarding terms that do not depend on \(k\), we see that this is equivalent to assigning \(x\) to the class with the largest discriminant score

\[\delta_k(x) = x \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2 \sigma^2} + \log(\pi_k)\]

Note that \(\delta(x)\) is a linear function of \(x\). If there are \(K = 2\) classes and \(\pi_0 = \pi_1 = 0.5\), then one can see that the decision boundary is at \(x = \dfrac{\mu_0 + \mu_1}{2}\)

The math behind it

Typically we do not know the parameters for the normal distributions under each class, but we have training data. We can simply estimate the parameters and plug them into the decision rule.

The prior terms can be estimated via the sample proportions: \[\hat{\pi}_k = \dfrac{n_k}{n}\] We can estimate mean and variances with the corresponding sample statistics.

\[\begin{aligned} &\hat{\mu}_k = \dfrac{1}{n_k} \sum_{i : y_i = k} x_i \\ &\hat{\sigma}_{k}^{2} = \dfrac{1}{n_{k} - 1} \sum_{i : y_i = k} (x_i - \hat{\mu}_k)^2 \end{aligned}\]

Since LDA assumes common variance, we need an estimate: \(\hat{\sigma}^2\) is known as pooled variance \[\hat{\sigma}^2 = \dfrac{1}{n - K} \sum_{k = 1}^K (n_k - 1) \hat{\sigma}_k^2 \]

Sources of error

Theory says that the Bayes classifier, which classifies an observation to the class for which \(p_k(x)\) is largest, has the lowest possible error rate out of all classifiers.

Caveats (sources of error):

  • we need to make distributional assumptions on \(f_k(x)\)
  • we need to estimate the parameters
  • sometimes we have knowledge of the class membership probabilities \(\pi_1, \dots, \pi_K\) which can be used directly. Otherwise, LDA estimates \(\pi_k\) using the proportion of the training observations that belong to the \(k^{th}\) class.

Linear Discriminant Analysis, \(p > 1\)

The assumption is that \(X = (X_1,X_2, \dots ,X_p)\) is drawn from a multivariate Gaussian distribution, with a class-specific mean vector and a common covariance matrix.

Discriminant function: \[\delta_k(x) = x^\intercal \Sigma^{-1} \mu_k - \frac{1}{2} \mu_k^\intercal \Sigma^{-1} \mu_k + \log \pi_k\] This can be written as: \[\delta_k(x) = a_{k0} + a_{k1} x_1 + \dots + a_{kp} x_p\] which is a linear function of \(x\).

Linear Discriminant Analysis, \(p > 1\)

\[f(x) = \frac{1}{(2 \pi)^{p/2} |\Sigma|^{1/2}} e^{- \frac{1}{2} (x - \mu)^\intercal \Sigma^{-1} (x - \mu)}\]

  • Given a joint normal distribution, the both the marginal distributions and the conditional distributions are normal, too!
  • The covariance matrix indicates the orientation of the elliptic shape
  • The marginal variances indicate the elongation of the elliptic shape along each axis

Linear Discriminant Analysis, \(p > 1\)

Illustration: \(p = 2\) and \(K = 3\)

Note that there are three lines representing the Bayes decision boundaries because there are three pairs of classes among the three classes.

Quadratic Discriminant Analysis

With Gaussians but different \(\Sigma_k\) in each class, i.e. \(f_k(x) = \mathcal{N}_p(\mu_k, \Sigma_k)\), we get quadratic discriminant analysis (QDA).

Naive Bayes

With \(f_k(x) = \prod_{j=1}^p f_{jk}(x)\) (conditional independence model) in each class we get naive Bayes. For Gaussian this means the \(\Sigma_k\) are diagonal.

  • Useful when \(p\) is large, and so multivariate methods like QDA and even LDA break down
  • Despite strong assumptions, naive Bayes often produces good classification results
  • It can be easily used with densities other than normal, e.g. \(f_{jk}(x)\) can be Bernoulli, Multinomial, Poisson, …

Similarities with Logistic regression

For a two-class problem, one can show that for LDA

\[\log \left( \dfrac{p_1(x)}{1 - p_1(x)} \right) = c_0 + c_1 x_1 + \dots + c_p x_p\]

so it has the same form as logistic regression.

The difference is in how the parameters are estimated:

  • Logistic regression uses the conditional likelihood based on \(\mathbb{P}(Y \mid X)\)
  • LDA uses the full likelihood based on \(\mathbb{P}(X, Y)\)

Despite these differences, in practice the results are often very similar.

Logistic regression can also fit quadratic boundaries like QDA, by explicitly including quadratic terms in the model.

To sum up

  • LDA: \(X \mid Y = k \sim \mathcal{N}_p (\mu_k, \Sigma)\)
  • QDA: \(X \mid Y = k \sim \mathcal{N}_p (\mu_k, \Sigma_k)\)
  • Naive Bayes: \(X \mid Y = k \sim \mathcal{N}_p (\mu_k, \sigma_k^2 \mathcal{I})\)

where \(\mathcal{N}_p(\mu, \Sigma)\) is the multivariate normal distribution.

k Nearest Neighbours

k-NN in practice

Given a positive integer \(k\) and a test observation \(x\), the k-NN classifier:

  1. identifies the \(k\) points in the training data that are closest to \(x\), represented by \(\mathcal{N}_k(x)\)
  2. estimates the conditional probability for class \(j\) as the fraction of points in \(\mathcal{N}_k(x)\) whose response values equal \(j\), i.e.  \[\mathbb{P} (Y = j \mid X = x) = \frac{1}{K} \sum_{i \in \mathcal{N}_k(x)} \mathcal{I} (y_i = j)\]

k-NN in practice

\(k = 3\)

k-NN in practice

\(k = 15\)

k-NN in practice

\(k = 60\)

Question time